Contents

  1. Correction with Known \(\mu\)
  2. Seperating Signals
  3. Empirical Coverage Proportions
  4. Possible Correction: Logit Proportions
  5. Using Reference Panel for Correction
  6. Next Steps and Questions

1. Correction with Known \(\mu\)



UK10K Data

Number of simulations making up each bin:

## 
##    (0,1]    (1,2]  (2,2.5]  (2.5,3]    (3,4] (4,4.75] (4.75,6]    (6,8] 
##      262      453      264      310      665      436      665      666 
##   (8,13]  (13,16] (16,Inf] 
##      680      254      270

1000 Genomes Data

Now for the high/medium/low LD regions using 1000 genomes data:

  • The number of simulations making up each bin:
##         
##          (0,1] (1,2] (2,2.5] (2.5,3] (3,4] (4,4.75] (4.75,6] (6,8] (8,13]
##   low       35   299     126      92   215      168      200   158    223
##   high      23   265     171      68   205      153      187   176    259
##   medium    28   257     181      94   189      189      189   169    250
##         
##          (13,16] (16,Inf]
##   low         44       94
##   high        33      104
##   medium      32      115

2. Seperating Signals


Problem: Our correction method is not as accurate in regions of high LD.

Ideas:


##     mu_obs    muhat   true   claimed corrected
## 1 18.66619 18.65872 0.9812 0.9574063 0.9224276


Effect of LD and effect size on seperability


Let’s now consider the effect of LD and effect size on the seperability (i.e. how well the true signal can be picked out).

Possible measures of LD:

  • Max \(r^2\) between the CV and any other SNP (maxr2)
  • Average \(r^2\) value of CV with all other SNPs (meanr2)
  • Median \(r^2\) value of CV with all other SNPs (medianr2)
  • Number of SNPs in LD with CV (i.e. with \(r^2>0.9\)) (r20.2 or r20.1 etc.)

Possible measures of seperability:

  • The number of variants in the credible set (nvar_cs)
  • The PP at the CV (will be larger if the CV has been well seperated) (PP_iCV)

To visualise this, I plot \(\mu\) against the LD for many simulations, with contours added reflecting the seperability.


I then take vertical slices of the plots on the right hand side (for low, medium and high LD) to investigate the relationship between the value of \(\mu\) and the seperability of the signals. We are looking to see if there is a way we can model this relationship.



  • It is perhaps better if we look at the logit of the seperability.

  • For this I focus on the “max \(r^2\) with the CV” LD measure.

Method:

  1. Add logit(PP_iCV) column to original data (removing any Inf rows)
  2. Fit a loess loess(logit_PP_iCV ~ maxr2 * true.mu, data = data)
  3. Generate a sequence of predictor values (for maxr2 and true.mu)
  4. Predict the corresponding values of logit_PP_iCV
  5. Convert results to long data table format (melt function)
  6. Select rows of results corresponding to LD bin (i.e. 0.3<maxr2<0.35)
  7. Plot logit_PP_iCV against true.mu
  • This looks promising as we relationship now looks almost linear (note differing axis values - not \(y=x\) line).


3. Empirical Coverage Proportions




4. Possible Correction: Logit Proportions


Explanation: In our correction, we are averaging something that has an upper bound of 1 (the empirical coverage proportions), where lots of the data points lie close to the upper bound. This means that it is “easier” for the empirical coverage proportions to be too low than too high, and lower estimates are more detrimental than higher estimates. This may explain why our corrected coverage estimate is consistently too low in high powered scenarios.

Potential fix would be to take the logit of the prop_cov quantities used in the final corrected coverage estimate weighting, so that they are no longer bounded.

Our final corrected coverage estimate is then given by:

logit <- function(x) log(x/(1-x))
invlogit <- function(x) exp(x)/(1 + exp(x))

corrcov_tmp <- sum(pp0 * logit(prop_cov))
corrcov <- invlogit(corrcov_tmp)


5. Using Reference Panel



Method:

  1. Select 200 SNPs from UK10K data
  2. Find the corresponding SNPs (by their genomic position) in the 1000 genomes data
  3. Remove SNPs from the analysis if they cannot be found or if they map to multiple rows in the 1000 genomes data (removes ~2-5 SNPs on average)
  4. Find the haplotypes for the remaining SNPs in the UK10K data and use simGWAS to generate the system to be corrected
  5. Find the corrected coverage using the “true LD” from the UK10K haplotypes
  6. Find the corrected coverage using the LD estimated from the 1000 Genomes reference panel


The number of simulations in each bin:

## 
##    (0,1]    (1,2]  (2,2.5]  (2.5,3]    (3,4] (4,4.75] (4.75,6]    (6,8] 
##      237      390      259      277      565      461      615      575 
##   (8,13]  (13,16] (16,Inf] 
##      665      238      258

Side note: Why does the normal corrected coverage estimate not look as good as normal? Perhaps because these simulations are smaller - or because we are trimming SNPs not in the 1000 Genomes data?


6. Next Steps and Questions